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1. INTRODUCTION 

Movies, online videos and television are most popular source of entertainment across the globe 
especially in India [1]. Movie industry involve huge sum of investment in terms of money, time and effort [2]. 
Movie industry is producing hundreds of movies every year. Therefore, it is crucial to predict success of a 
movie in early stages. Success or failure of a movie is based on multiple factors. A huge amount of information 
related to movies such as actors, directors, critic review, user reviews, ratings, writer, budget, genre, Facebook 
likes, number of views on YouTube for movie trailer, and fan following on twitter. are available on web. 

Success of movie in this era depends on the revenue generated in first few weeks [3]. The revenue 
generation in initial weeks is greatly influenced by online reviews and ratings of the movie. Since first few 
weeks are very crucial for the success of movie, and the movie production people put in lot of efforts on the 
publicity and building people’s opinion. In this work, we aim to use the available information to predict the 
success rate of a movie in early stages. The internet movie database (IMDb) is a rich source of information 
which contains the data about almost all the movies. 

To predict the success of a movie, the supervise machine learning algorithms are used. Different 
machine learning algorithms are used to build the prediction model and the results obtained from each mode 
are compared over root mean squared error (RMSE), mean square error (MAE) and R2 score. Further, social 
media such as twitter, Instagram and Facebook has become a great source of influence on people’s opinion. A 
huge amount of data is generated through such sources and they are important means of gathering the movie 
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reviews. Opinions about movies are mostly expressed in form of reviews and comments. However, this huge 
amount of data presents a challenge of information overload. Therefore, the need to automate the process of 
extracting the sentiments from the information available has been posed. In this work the sentiment analysis 
model is developed using machine learning and natural language processing to analyse the reviews/comments 
and predict the sentiment of a tweet and reviews. The overall sentiments of a reviews/tweets are classified in 
one of the two classes positive and negative. Further, different machine learning algorithms have been applied 
for sentiment analysis and the results are compared to find the best suited model for the movie success rate 
prediction. Once the above two model (prediction and sentiment analysis models) are built, the hybrid success 
rating is predicted using these two models together. 

A lot of work related to movie rating prediction and sentiment analysis of reviews can be found in 
literature, specifically in movie domain. Most of the sentiment analysis and rating prediction work use machine 
learning and natural language processing to some extents. The work focusing on predicting the polarity of 
movie reviews includes the following. Turney [4] used unsupervised learning to classify the movie based on 
their average semantic orientation. Pang et al. [5] Compared the performance of machine learning algorithms 
on movie reviews and according to their analysis support vector machines (SVM) classifier has shown the best 
performance. Mishne and Glance [6] used blogger sentiments to predict movie sales. Used combination of 
machine learning and simple technique based on counting of positive and negative words in review [7]. 
Proposed movie gross prediction trough new analysis [8]. Claimed that the performance of baseline machine 
learning algorithm used for text classification varies greatly based on the selection of model variant and features 
used [9], [10]. Garcfa-Cumbreras ef al. [10] proposed an approach to improve collaborative filtering using 
sentiment analysis. Combined generative and discriminative model together for sentiment prediction [11]. 
Presented an improved sentiment analysis of online movie reviews based on clustering [12]. Proposed hybrid 
convolutional neural network (CNN)-long short-term memory (LSTM) based approach for sentiment analysis 
[13], [14]. Proposed an attention-based long short-term memory, ‘Senti_ALSTM,’ model for sentiment analysis 
[15]. Has proposed the predictive model for Bollywood movies [16]. Proposed lexicon and neural network based 
approach for sentiment analysis [17]. Proposed movie success rate prediction model using random forest method 
[18]. Proposed the decision tree based approach for rating prediction [19]. Proposed LSTM based sentiment analysis 
for movie review [20]. Authors proposed news based recommendation system using multinomial Naive Bayes 
(MNB) classifier [21]. In contrast to the existing work, the aim of proposed work utilizes a hybrid model which 
combines the results of prediction model and twitter sentiment analysis model to predict the success of movie. The 
rest of the paper is organized as follows. Section 2 discusses the methods. The results of proposed work are 
deliberated in section 3. Finally, section 4 concludes our work. 


2. METHOD 

The entire work is divided in two components namely rating prediction model and sentiment analysis 
model. Figure 1 shows the system design. Initially the supervised machine learning model is generated for 
rating prediction based on multiple features described in dataset section. The model is evaluated over RMSE, 
MAE, and R2 score. The prediction model contains multiple independent features (genre, runtime, budget, 
crew’s popularity, and aspect ratio) and one output variable i.e., rating. Once this model is generated the rating 
of a new movies may be predicted based on the input features of new movie. 
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Figure 1. Overall system design 
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The dataset of the text reviews is used to generate the supervised machine learning model for sentiment 
analysis [22]. The sentiment analysis model can be used to identify the positive and negative sentiments of the 
movie reviews and a sentiment score is computed. Further, these two models are used to predict rating and 
sentiment score of the new movie concurrently. Weighted sum of the two generated values is computed. The 
experiments are conducted for different weight combinations and the results shown that when the 0.6 weight is 
assigned to rating prediction and 0.4 weight is assigned to the sentiment score the results obtained are optimal. 
This hybrid rating is the final output (predicted movie success rating) of the system. The forthcoming section 
describes the system components in detail. For rating prediction and sentiment analysis both the supervised 
learning is used. Figure 1 shows the overall supervised learning architecture used for both rating prediction and 
sentiment analysis. Both of these components are discussed in detail in subsequent sections. 


2.1. Rating prediction 
Each stage shown in Figure 2 in context of rating prediction model is deliberated as follows: 

- Dataset: Part of IMDb data is available for non-commercial use. The dataset is downloaded from official 
IMDb website [23]. The downloaded dataset contains more than one million entries. The ‘title.akas.tsv’ 
and ‘title.basics.tsv’ contains all the information about movie titles. The ‘title.crew.tsv’ and 
‘title_principals.tsv’ contains all the information about people (directors, actors, and actresses) involved. 
All the data in these files are mostly nominal in nature. The ‘titles.ratings.tsv’ has the most important 
information that is rating and number of votes the title has received. 

-  Pre-processing and transformation: The prediction of rating is based on the movie features such as genre, 
runtime, budget, crew’s popularity, and aspect ratio. The data in IMDb is in raw form and it needs further 
pre-processing. Data were spread across multiple .tsv files and therefore this data is converted to csv format 
and merged into a single file with python script. For experimental purpose, only movie released in English 
language are filtered out and used for model generation and evaluation. The data was thoroughly analyzed 
to check for any illegal characters. Missing values are handled by replacing them by mean values in case 
of numerical data and replacing by most frequent values in case of categorical data. Since models can 
handle numerical data types therefore the categorial and ordinal features are converted to numerical 
features using data transformation methods. Figure 3 shows different distribution figures such as movie 
runtime, rating, budget, and vote count distributions. 

- Splitting training and testing data: The processed data is now divided into two subsets, training and testing 
dataset. Training dataset is used to fit the model and testing dataset is used to evaluate the prediction 
accuracy. The data is divided into a ratio of 80:20 for training and testing. The data division is done in this 
ratio intuitively based on Pareto Principle (80/20 rule). Model training: The aim of training is to fit the 
model using the training data. The data contains multiple features and one target variable (rating). Features 
and target variables are separated and the various supervised learning models are applied. The model is 
trained using random forest, simple regression tree and linear regression algorithms. 

- Simple regression tree is a non parametric supervised learning which predicts target by applying simple decision 
tules. It partitions the dataset into subsets and fits the simple model for each subset. However, a single tree model 
is mostly unstable. Random forest is a supervised learning algorithm and estimator fits number of decision tress 
on various sub sets of datasets. To improve prediction accuracy and control overfitting, random forest use 
averaging. Linear regression is also a supervised learning algorithm which predicts the dependent variable value 
based on the given dependent features. The equation for regression used is shown in (1). 
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Figure 2. Machine learning model 
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yi=P0+ B1ixil + B_2x_i2 +-:..4+B_px_ip +e (1) 


Where, for i=n observations, y_i is dependent variable, x_i is the explanatory variable, B 0 is y-intercept 
(constant), B_p are the slop coefficients for each explanatory variables and € the error term for model (it is also 
known as residuals). 
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Figure 3. Density distribution 


Testing and validation: After training the model, prediction model is tested on the remaining testing data and 
performance is evaluated. In the validation step, the accuracy of results is verified, comparing the test data with 
prediction. 


2.2 


the 


. Sentiment analysis 


Each stage shown in Figure 2 in context of sentiment analysis model is discussed as follows: - 
Dataset: The dataset of sentiment analysis is downloaded from [22]. This dataset contains IMDb movie 
reviews along with their respective sentiment polarity labels i.e. negative or positive. The core dataset 
contains around 50K reviews and they are further split into 25K training set and 25K testing set. Train and 
test set contains disjoint set of movies. 

Data pre-processing and transformation: This is an important process in text classification and it transforms 
the raw data into understandable format for learning model. The pre-processing steps followed are 
tokenization (breaking stream of text into words or meaningful tokens), stemming (reducing the 
inflectional forms). The reviews text may contain non-alphabetic characters, stop words, and URLs. 
During pre-processing of tweets, the special characters, emails, urls, and repetitive characters. Have been 
removed. Table | shows the reviews before and after pre-processing. Unigram based bag of word (BOW) 
approach is used as feature selection approach. 

Splitting training and testing data: The dataset already contain the train and test split in ratio of 50:50. 
Model training: To perform sentiment analysis linear support vector classifier (SVC), Naive Bayes 
Classifier and logistic regression models are used for classification of review text into one of the two 
categories positive and negative. 

Testing and validation: Once the best fit hyperplane is obtained, the prediction for new features can be 
calculated. Finally, performance evaluation of the model is done. 

This model is now used for classification of tweets related to a particular movie. The Figure 4 shows 

overall process. Tweets are collected from twitter application by using the public application programming 
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interface (API) provided by Twitter. Twitter provides the authentication keys for extraction of tweets for 
authentic requests. Consumer key, consumer secret key, access token and access token key are the unique keys 
required to fetch the tweets. The retrieved tweets contain information like user ID, date of tweet, retweet count, 
tweet ID and so on. All the tweets related to a particular movie and all the news and comments related to 
particular movie are fetched. 

Next, the sentiment analysis model is applied on the collected tweets. Firstly, various parameters like 
keywords, language, and node entities are set. Next, the user credentials are extracted. Once the user credentials 
are set, tweets are analysed for identifying the polarity of tweet using the sentiment analysis model generated 
above. The scores of a movie are decided on the bases of the polarity, if polarity is positive then just add 1 to 
the score, otherwise remove | from the score. The score computed are used for hybrid rating model. These 
scores may also be used to find the most popular movies of a particular calendar year. 


Table 1. Reviews before and after pre-processing 
Original Reviews Reviews after Pre-processing 
When I first tuned in on this morning news, I... when i first tuned in on this morning news it... 
Mere thoughts of "Going Overboard" (aka "Babes...mere thoughts of going overboard aka babes aho... 
Why does this movie fall WELL below standards?...why does this movie fall well below standards... 
Wow and I thought that any Steven Segal movie... wow and i thought that any steven segal movie... 
The story is seen before, but that does'n matt... the story is seen before but that doesand matt... 
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Figure 4. Sentiment analysis process 


2.3. Hybrid rating 

The rating obtained from the rating prediction model and the sentiment analysis score (as discussed 
in above two sections). Of the movie are combined to get the overall rating of the movie. This rating is now 
considering the movie features as well the social media sentiment for prediction the rating. 


2.4. Implementation 
Python APIs are used to implement the prediction model discussed above. The important libraries 
used are [24]-[27]. 
— Data pre-processing: In this work, pandas library is used to create dataframes for the data processing and 
transformation. 
— Splitting the training and testing dataset: train_test_split() function of sklearn is used for splitting the 
dataset. 
— Training the model: Random forest regressor, decision tree regressor, linear regression, linear SVC, Naive 
Bayes, logistic regression model of sklearn are used for training. 
— Testing: Testing and evaluation are done using sklearn.metrics. 
—  Matplotlib is used for plotting the data and results and plotly is mainly used for displaying graphs. 
For sentiment analysis of collected tweets, TwitterSearch package is used. Firstly, the Twitter Search 
Order class available in Twitter Search package is used to set various parameters like setting keywords, 
language, and node entities. Next, the twitter search object is created where user credentials such as key, 
secret_access_token are set. Further, the algorithm iterates through the tweets and analyse them by checking 
the sentiment polarity using textblob for tweets. 
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3. RESULTS AND EVALUATION 

This section discusses the results obtained. The MAE, RMSE and R2 of various models is shown in 
Table 2. The random forest algorithm performs the worst and Linear regression performs best in terms of MAE. 
It is also found that the linear regression model was very stable in predicting values. Therefore, the linear 
regression model is taken as the model for rating prediction. Figure 5 shows the actual and predicted ratings of 
some of the testing samples for linear regression model. The predicted value is very close to the real value in 
most of the cases. 

The accuracy of the sentimental analysis model is shown in Table 3. The accuracy of linear SVC is 
highest among three i.e. 88.47% and therefore it is used for the hybrid model. The top ten movies of a particular 
year are identified using the proposed model. The accuracy of the hybrid model is further higher than the 
individual ones. These movies are identified considering the twitter sentiment analysis. The top movies 
identified for year 2017 are shown in Figure 6. The actual top movies of 2017 based on user votes are Blade 
Runner, Thor: Ragnarok, Call Me By Your Name, Wind River, The Greatest Showman. This shows that the 
results predicted are close to the actual results. 


Table 2. Comparison of MAE and RMSE for different models 
Model Used MAE _ RMSE R2 
Random Forest 0.491 O511 0.712 
Simple Regression Tree 0.452 0.699 0.601 
Linear Regression 0.392 0.501 _0.718 
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Figure 5. Actual vs predicted ratings 


Table 3. Performance comparison of sentiment analysis 


Model Precision Fl Score Accuracy 
Logistic Regression 0.81 0.84 0.8480 
Linear SVC 0.84 0.88 0.8847 
Naive Bayes 0.80 0.83 0.838 
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Figure 6. Top 10 movies of year 2017 
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4. CONCLUSION 

Success of movie does not only depend on the features of movie but also influenced by social media 
reviews and comments. In this work the hybrid approach is used to predict the success rate of a movie. The 
hybrid approach considers both, the movie features as well as sentiment expressed in the movie review. For 
rating prediction, random forest, simple regression tree and linear regression models are generated and 
compared. Based on the performance of all the model over RMSE, MAE and 12 score, the linear regression 
model is selected for the proposed movie success prediction approach. Further, for review sentiment analysis, 
linear SVC, Naive Bays and logistic regression models are compared over accuracy, precision and f1 score. 
The performance of linear SVC was best among the tree models and therefore, it is selected for the proposed 
success rate prediction approach. Finally, the success rate of a new movie is predicted by combing the two 
models generated and analysing the related tweets scrapped from twitter. In future, other machine learning 
models may be implemented and tested for movie success rate prediction. 
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